Warning: package 'here' was built under R version 4.4.1
here() starts at C:/Users/aloys/OneDrive/Documents/Year 1 Tri 3/AAI 1001 Data Engineering and Visualization/DEnVGrp3Proj
1 Original Data Visualization in News Media
The geographical distribution of Singapore’s population is a focal point of urban studies and public policy discussions. The introduction of a new visualization of demographic trends across various planning areas provides a novel perspective on this discourse. Our project aims to dissect the relationship between demographic characteristics and urban planning policies. The observed patterns, while not entirely random, may be consistent with hypotheses regarding the impact of strategic urban development on population distribution as seen in Figure 1.
Our visualization, inspired by the seminal work of the Singapore Department of Statistics (2023), encapsulates data from the year 2023, a period marked by significant urban and demographic changes. While our initial rendition has been commended for its clarity in showcasing trends, it is the addition of interactive elements that promises a deeper engagement with the data. Nonetheless, we acknowledge room for improvement. Incorporating interactive toggles, expanded temporal ranges, and detailed geospatial mappings will refine our approach, providing a more comprehensive exploration of how urban planning influences demographic distribution and housing patterns in Singapore.
Figure 1: Visualization of Choropleth Map of Resident Population Density by the Department of Statistics Singapore (Singstat 2023)
2 Critical Assessment of the Original The selected visualization from
The Singapore Department of Statistics presents several variables: population distribution (quantitative) and planning areas (categorical). Additionally, the visualization includes a heatmap that allow users to delve into subzones to see pattern in land development and population over time.
However, there are some shortcomings the team has identified:
Complexity: While the visualization is thorough, its complexity may be overwhelming for some users, especially those unfamiliar with demographic data or geographical information systems (GIS).
Accessibility: The reliance on color and detailed graphics may pose accessibility challenges for users with visual impairments or those who are not tech-savvy
Data Density: The high density of information in a single visualization can lead to cognitive overload, where important details might be overlooked due to the sheer volume of data presented. (Front Psychol. 2023)
Limited Temporal Range: The visualization only covers data for the year 2023, which constrains the analysis to a narrow timeframe and does not allow for the examination of trends over time.
Lack of Customization: Users cannot customize the time range or select specific years of interest, limiting the depth of their analysis.
Static Elements: While some elements are interactive, the visualization could benefit from more dynamic features such as changing bubble sizes or color gradients to depict demographic shifts over time.
3 Proposed Improvements
We propose to address the shortcomings of the original visualization as follows:
Better contrast: Utilize high-contrast colors to improve accessibility for users with visual impairments, ensuring clarity and ease of interpretation for all users. One such example is the use of Color Universal Design (CUD) colors which are designed to be distinguishable by all users, including those with color vision deficiencies. (Okabe and Ito 2008)
Reduced Data Density: Simplify the presentation by reducing the number of data points displayed simultaneously, thus preventing overcrowding and making the visualization more comprehensible.
Interactive Elements: Hovering over a country will display a tooltip with detailed information on the population size of a certain region of singapore as well as the age profile.
Expanded Temporal Ranges: Introduce options for users to select specific time periods for analysis, facilitating a deeper exploration of trends over time.
4 Data Cleaning
The Singapore Department of Statistics based its visualization on data collected by the Singapore Government from the year 2023 available in CSV format. The data includes the following columns: Planning Area, Population, Age Profile, and Gender (2023). For our improved visualization we will be using dataset dating as far back as 2000 to 2023. The department of statistics categorize the data into a range of 10 years (e.g 2000-2010, 2011-2020) and for data set that have not met the 10 year range are given in seperate files instead. As such we will first have to combined the data into a single file and then clean the data to ensure that it is in a Dataframe.
# Set the working directory to the root of the project foldersetwd(here::here())# List all CSV files in the data folderfile_list <-list.files("data", pattern ="*.csv")# Print the list of filesprint(file_list)
# Read and combine all CSV filescsv_list <-lapply(file_list, function(x) read.csv(file.path("data", x)))df <-do.call(rbind, csv_list)# Save the combined data frame to a new CSV file (optional if you want to view it before we start cleaning) # write.csv(df, file.path("data", "combined_data.csv"), row.names = FALSE)
Once the data has been combined, we will perform a summary of the data using the glimpse() and tail() function to understand the structure of the data and identify any missing values or inconsistencies.
# Display the first few rows of the combined data framehead(df)
PA SZ Age Sex Pop Time
1 Ang Mo Kio Cheng San 0 Males 140 2000
2 Ang Mo Kio Cheng San 0 Females 130 2000
3 Ang Mo Kio Cheng San 1 Males 180 2000
4 Ang Mo Kio Cheng San 1 Females 140 2000
5 Ang Mo Kio Cheng San 2 Males 160 2000
6 Ang Mo Kio Cheng San 2 Females 130 2000
# Display the last few rows of the combined data frametail(df)
PA SZ Age Sex Pop Time
1393751 Yishun Yishun West 88 Males 40 2023
1393752 Yishun Yishun West 88 Females 70 2023
1393753 Yishun Yishun West 89 Males 40 2023
1393754 Yishun Yishun West 89 Females 40 2023
1393755 Yishun Yishun West 90_and_Over Males 70 2023
1393756 Yishun Yishun West 90_and_Over Females 200 2023
We can check the total number of rows in the data frame.
# Display the total number of rows in the data framenrow(df)
[1] 1393756
Based on the data type and structure, we will remove the columns that are not relevant to our visualization and clean the remaining columns to ensure consistency and accuracy. We will also aggregate the data to create a new data frame that consolidates the total population by age group, planning area, and time period.
# Deleting "SZ" and "Sex" columns as they are not relevant to our visualizationdf <- df %>%select(-"SZ", -"Sex")# Function to create age groups from 1,2,3,4... to 1-9, 10,19...create_age_group <-function(age) {if (age =="90_and_over"|| age =="90_and_Over") {return("90 and Over") } else { age_num <-as.numeric(age) group_start <- (age_num %/%10) *10 group_end <- group_start +9return(paste0(group_start, " to ", group_end)) }}# Mutating the age column into a new AgeGroup columndf <- df %>%mutate(AgeGroup =sapply(Age, create_age_group))head(df)
PA Age Pop Time AgeGroup
1 Ang Mo Kio 0 140 2000 0 to 9
2 Ang Mo Kio 0 130 2000 0 to 9
3 Ang Mo Kio 1 180 2000 0 to 9
4 Ang Mo Kio 1 140 2000 0 to 9
5 Ang Mo Kio 2 160 2000 0 to 9
6 Ang Mo Kio 2 130 2000 0 to 9
Next, we will clean the ‘Pop’ and by removing any non-numeric characters and convert it to a numeric data type for aggregation to make the dataframe more memory efficient as having char data type for these columns can take up alot of memory usage which can hamper the work flow. Within one of the original data set, the data gathered in 2000, had a group of which did not state which Planning Area they were staying in. As such we will remove these rows as it is not relevant to our visualization. Afterwards we will then aggregate the data to calculate the total population by age group, planning area, and time period.
# Function to clean the 'Pop' column by removing non-numeric charactersclean_pop <-function(pop) {# Remove any non-digit characters clean_pop <-gsub("[^0-9]", "", pop)return(clean_pop)}# Function to drop any row containing "Not Stated" in the PA column as this is not relevant to our visualizationdrop_not_stated <-function(df, column_name ="PA") {# Filter out rows where the specified column contains "Not Stated" df <- df %>%filter(!(.data[[column_name]] =="Not Stated"))return(df)}# Apply the drop_not_stated function to the dataframedf<-drop_not_stated(df, "PA")# Apply the cleaning function to the 'Pop' columndf$Pop <-sapply(df$Pop, clean_pop)# Convert Pop to numeric for aggregationdf$Pop <-as.numeric(df$Pop)# Move columns to a new data frame to consolidate total population, Age Group and Timeaggregated_df <- df %>%group_by(PA, AgeGroup, Time) %>%summarise(TotalPop =sum(Pop, na.rm =TRUE))
`summarise()` has grouped output by 'PA', 'AgeGroup'. You can override using
the `.groups` argument.
head(aggregated_df)
# A tibble: 6 × 4
# Groups: PA, AgeGroup [1]
PA AgeGroup Time TotalPop
<chr> <chr> <int> <dbl>
1 Ang Mo Kio 0 to 9 2000 21160
2 Ang Mo Kio 0 to 9 2001 19490
3 Ang Mo Kio 0 to 9 2002 18490
4 Ang Mo Kio 0 to 9 2003 17770
5 Ang Mo Kio 0 to 9 2004 17080
6 Ang Mo Kio 0 to 9 2005 16550
We can save the cleaned data to a new CSV file for future use.
# Saved for future usewrite_csv(aggregated_df, "cleaned_data.csv")
library(sf)
Warning: package 'sf' was built under R version 4.4.1
Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(tmap)
Warning: package 'tmap' was built under R version 4.4.1
Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
remotes::install_github('r-tmap/tmap')
# Read the KML filesingapore_kml <-st_read("singapore.kml")
Reading layer `ELD2020' from data source
`C:\Users\aloys\OneDrive\Documents\Year 1 Tri 3\AAI 1001 Data Engineering and Visualization\DEnVGrp3Proj\singapore.kml'
using driver `KML'
Simple feature collection with 31 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158762 xmax: 104.0885 ymax: 1.470783
Geodetic CRS: WGS 84
# Inspect the KML file structurestr(singapore_kml)
Classes 'sf' and 'data.frame': 31 obs. of 3 variables:
$ Name : chr "RADIN MAS" "MOUNTBATTEN" "TANJONG PAGAR" "JALAN BESAR" ...
$ Description: chr "" "" "" "" ...
$ geometry :sfc_MULTIPOLYGON of length 31; first list element: List of 1
..$ :List of 1
.. ..$ : num [1:131, 1:2] 104 104 104 104 104 ...
..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
- attr(*, "sf_column")= chr "geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA
..- attr(*, "names")= chr [1:2] "Name" "Description"
head(singapore_kml)
Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6911 ymin: 1.260668 xmax: 103.9203 ymax: 1.344064
Geodetic CRS: WGS 84
Name Description geometry
1 RADIN MAS MULTIPOLYGON (((103.8248 1....
2 MOUNTBATTEN MULTIPOLYGON (((103.9203 1....
3 TANJONG PAGAR MULTIPOLYGON (((103.8458 1....
4 JALAN BESAR MULTIPOLYGON (((103.8738 1....
5 MACPHERSON MULTIPOLYGON (((103.8818 1....
6 PIONEER MULTIPOLYGON (((103.7083 1....
# Set tmap mode to "view" for interactive mapstmap_mode("view")
tmap mode set to interactive viewing
tmap_options(check.and.fix =TRUE)tmap_options(max.categories =31)# Create the interactive maptm_shape(singapore_kml) +tm_borders("blue", lwd =1) +tm_fill(col ="Name", palette ="Set3", alpha =0.5) +# Replace 'Name' with the appropriate column nametm_text("Name", size =0.7) +# Replace 'Name' with the appropriate column nametm_view(bbox =st_bbox(singapore_kml)) # Zoom to the extent of the Singapore data
Warning: The shape singapore_kml is invalid. See sf::st_is_valid
singapore_kml
Simple feature collection with 31 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158762 xmax: 104.0885 ymax: 1.470783
Geodetic CRS: WGS 84
First 10 features:
Name Description geometry
1 RADIN MAS MULTIPOLYGON (((103.8248 1....
2 MOUNTBATTEN MULTIPOLYGON (((103.9203 1....
3 TANJONG PAGAR MULTIPOLYGON (((103.8458 1....
4 JALAN BESAR MULTIPOLYGON (((103.8738 1....
5 MACPHERSON MULTIPOLYGON (((103.8818 1....
6 PIONEER MULTIPOLYGON (((103.7083 1....
7 POTONG PASIR MULTIPOLYGON (((103.889 1.3...
8 YUHUA MULTIPOLYGON (((103.7373 1....
9 BUKIT BATOK MULTIPOLYGON (((103.7484 1....
10 JURONG MULTIPOLYGON (((103.7373 1....
# Load the librarieslibrary(sf)library(ggplot2)# Define the path to your KML file (ensure this path is correct)kml_file_path <-"singapore2.kml"# Read the KML filesingapore_map <-st_read(kml_file_path)
Reading layer `singapore_Division_level_2' from data source
`C:\Users\aloys\OneDrive\Documents\Year 1 Tri 3\AAI 1001 Data Engineering and Visualization\DEnVGrp3Proj\singapore2.kml'
using driver `KML'
Simple feature collection with 10 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6811 ymin: 1.254808 xmax: 104.0336 ymax: 1.416214
Geodetic CRS: WGS 84
# Inspect the KML data to understand its structureprint(head(singapore_map))
# Plot the map with ggplot2ggplot(data = singapore_map) +geom_sf() +theme_minimal() +labs(title ="Planning Areas in Singapore",x ="Longitude",y ="Latitude")
# Load the librarieslibrary(leaflet)
Warning: package 'leaflet' was built under R version 4.4.1
library(sf)library(lwgeom)
Warning: package 'lwgeom' was built under R version 4.4.1
Linking to liblwgeom 3.0.0beta1 r16016, GEOS 3.12.1, PROJ 9.3.1
Attaching package: 'lwgeom'
The following object is masked from 'package:sf':
st_perimeter
# Define the path to your KML file (ensure this path is correct)kml_file_path <-"singapore.kml"# Read the KML filesingapore_map <-st_read(kml_file_path)
Reading layer `ELD2020' from data source
`C:\Users\aloys\OneDrive\Documents\Year 1 Tri 3\AAI 1001 Data Engineering and Visualization\DEnVGrp3Proj\singapore.kml'
using driver `KML'
Simple feature collection with 31 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158762 xmax: 104.0885 ymax: 1.470783
Geodetic CRS: WGS 84
# Clean and validate the geometriessingapore_map_clean <-st_make_valid(singapore_map)# Optionally, check for and remove any empty geometriessingapore_map_clean <- singapore_map_clean[!st_is_empty(singapore_map_clean),]# Inspect the cleaned KML data to ensure it's validprint(head(singapore_map_clean))
Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6911 ymin: 1.260668 xmax: 103.9203 ymax: 1.344064
Geodetic CRS: WGS 84
Name Description geometry
1 RADIN MAS MULTIPOLYGON (((103.8248 1....
2 MOUNTBATTEN MULTIPOLYGON (((103.9203 1....
3 TANJONG PAGAR MULTIPOLYGON (((103.8458 1....
4 JALAN BESAR MULTIPOLYGON (((103.8738 1....
5 MACPHERSON MULTIPOLYGON (((103.8818 1....
6 PIONEER MULTIPOLYGON (((103.7083 1....
# Load the librarieslibrary(sf)library(lwgeom)library(plotly)
Warning: package 'plotly' was built under R version 4.4.1
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
# Define the path to your KML file (ensure this path is correct)kml_file_path <-"singapore.kml"# Read the KML filesingapore_map <-st_read(kml_file_path)
Reading layer `ELD2020' from data source
`C:\Users\aloys\OneDrive\Documents\Year 1 Tri 3\AAI 1001 Data Engineering and Visualization\DEnVGrp3Proj\singapore.kml'
using driver `KML'
Simple feature collection with 31 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158762 xmax: 104.0885 ymax: 1.470783
Geodetic CRS: WGS 84
# Clean and validate the geometriessingapore_map_clean <-st_make_valid(singapore_map)# Optionally, check for and remove any empty geometriessingapore_map_clean <- singapore_map_clean[!st_is_empty(singapore_map_clean),]# Inspect the cleaned KML data to ensure it's validprint(head(singapore_map_clean))
Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6911 ymin: 1.260668 xmax: 103.9203 ymax: 1.344064
Geodetic CRS: WGS 84
Name Description geometry
1 RADIN MAS MULTIPOLYGON (((103.8248 1....
2 MOUNTBATTEN MULTIPOLYGON (((103.9203 1....
3 TANJONG PAGAR MULTIPOLYGON (((103.8458 1....
4 JALAN BESAR MULTIPOLYGON (((103.8738 1....
5 MACPHERSON MULTIPOLYGON (((103.8818 1....
6 PIONEER MULTIPOLYGON (((103.7083 1....
# Create a Plotly mapplot_ly(singapore_map_clean) %>%add_sf(aes(fill =~st_area(singapore_map_clean), text =~paste("Planning Area: ", Name)),color =I("blue"),opacity =0.5 ) %>%layout(mapbox =list(style ="carto-positron",zoom =10,center =list(lat =1.3521, lon =103.8198) ),margin =list(r =0, l =0, t =0, b =0) ) %>%add_annotations(text ="Area",x =0.01,y =0.95,xref ="paper",yref ="paper",showarrow =FALSE,font =list(size =12, color ="black") ) %>%colorbar(title ="Area",len =0.5,y =0.5,x =0.01,yref ="paper",xref ="paper" )
No trace type specified:
Based on info supplied, a 'scatter' trace seems appropriate.
Read more about this trace type -> https://plotly.com/r/reference/#scatter
Warning: Didn't find a colorbar to modify.
Next how do we get the cleaned_data.csv file
# Load the cleaned datacleaned_data <-read_csv("cleaned_data.csv")
Rows: 13190 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): PA, AgeGroup
dbl (2): Time, TotalPop
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the cleaned datahead(cleaned_data)
# A tibble: 6 × 4
PA AgeGroup Time TotalPop
<chr> <chr> <dbl> <dbl>
1 Ang Mo Kio 0 to 9 2000 21160
2 Ang Mo Kio 0 to 9 2001 19490
3 Ang Mo Kio 0 to 9 2002 18490
4 Ang Mo Kio 0 to 9 2003 17770
5 Ang Mo Kio 0 to 9 2004 17080
6 Ang Mo Kio 0 to 9 2005 16550
5 Conclusion
The data is now ready for visualization. The next step will be to create a plot that can effectively communicate the relationship between the population density in each region of Singapore over time, and additionally allow curious readers to explore the data even further using interactivity. We will use ggplot2 package to create the plot, and plotly to add interactivity.
Okabe, M., & Ito, K. (2008). Color Universal Design (CUD): How to make figures and presentations that are friendly to Colorblind people. https://jfly.uni-koeln.de/color/